23 research outputs found
Survey on Publicly Available Sinhala Natural Language Processing Tools and Research
Sinhala is the native language of the Sinhalese people who make up the
largest ethnic group of Sri Lanka. The language belongs to the globe-spanning
language tree, Indo-European. However, due to poverty in both linguistic and
economic capital, Sinhala, in the perspective of Natural Language Processing
tools and research, remains a resource-poor language which has neither the
economic drive its cousin English has nor the sheer push of the law of numbers
a language such as Chinese has. A number of research groups from Sri Lanka have
noticed this dearth and the resultant dire need for proper tools and research
for Sinhala natural language processing. However, due to various reasons, these
attempts seem to lack coordination and awareness of each other. The objective
of this paper is to fill that gap of a comprehensive literature survey of the
publicly available Sinhala natural language tools and research so that the
researchers working in this field can better utilize contributions of their
peers. As such, we shall be uploading this paper to arXiv and perpetually
update it periodically to reflect the advances made in the field
Sinhala-English Parallel Word Dictionary Dataset
Parallel datasets are vital for performing and evaluating any kind of
multilingual task. However, in the cases where one of the considered language
pairs is a low-resource language, the existing top-down parallel data such as
corpora are lacking in both tally and quality due to the dearth of human
annotation. Therefore, for low-resource languages, it is more feasible to move
in the bottom-up direction where finer granular pairs such as dictionary
datasets are developed first. They may then be used for mid-level tasks such as
supervised multilingual word embedding alignment. These in turn can later guide
higher-level tasks in the order of aligning sentence or paragraph text corpora
used for Machine Translation (MT). Even though more approachable than
generating and aligning a massive corpus for a low-resource language, for the
same reason of apathy from larger research entities, even these finer granular
data sets are lacking for some low-resource languages. We have observed that
there is no free and open dictionary data set for the low-resource language,
Sinhala. Thus, in this work, we introduce three parallel English-Sinhala word
dictionaries (En-Si-dict-large, En-Si-dict-filtered, En-Si-dict-FastText) which
help in multilingual Natural Language Processing (NLP) tasks related to English
and Sinhala languages. In this paper, we explain the dataset creation pipeline
as well as the experimental results of the tests we have carried out to verify
the quality of the data sets. The data sets and the related scripts are
available at https://github.com/kasunw22/sinhala-para-dict
Identifying Relationships Among Sentences in Court Case Transcripts Using Discourse Relations
Case Law has a significant impact on the proceedings of legal cases.
Therefore, the information that can be obtained from previous court cases is
valuable to lawyers and other legal officials when performing their duties.
This paper describes a methodology of applying discourse relations between
sentences when processing text documents related to the legal domain. In this
study, we developed a mechanism to classify the relationships that can be
observed among sentences in transcripts of United States court cases. First, we
defined relationship types that can be observed between sentences in court case
transcripts. Then we classified pairs of sentences according to the
relationship type by combining a machine learning model and a rule-based
approach. The results obtained through our system were evaluated using human
judges. To the best of our knowledge, this is the first study where discourse
relationships between sentences have been used to determine relationships among
sentences in legal court case transcripts.Comment: Conference: 2018 International Conference on Advances in ICT for
Emerging Regions (ICTer
Multi-document Summarization: A Comparative Evaluation
This paper is aimed at evaluating state-of-the-art models for Multi-document
Summarization (MDS) on different types of datasets in various domains and
investigating the limitations of existing models to determine future research
directions. To address this gap, we conducted an extensive literature review to
identify state-of-the-art models and datasets. We analyzed the performance of
PRIMERA and PEGASUS models on BigSurvey-MDS and MS datasets, which posed
unique challenges due to their varied domains. Our findings show that the
General-Purpose Pre-trained Model LED outperforms PRIMERA and PEGASUS on the
MS dataset. We used the ROUGE score as a performance metric to evaluate the
identified models on different datasets. Our study provides valuable insights
into the models' strengths and weaknesses, as well as their applicability in
different domains. This work serves as a reference for future MDS research and
contributes to the development of accurate and robust models which can be
utilized on demanding datasets with academically and/or scientifically complex
data as well as generalized, relatively simple datasets
Sentiment Analysis with Deep Learning Models: A Comparative Study on a Decade of Sinhala Language Facebook Data
The relationship between Facebook posts and the corresponding reaction
feature is an interesting subject to explore and understand. To achieve this
end, we test state-of-the-art Sinhala sentiment analysis models against a data
set containing a decade worth of Sinhala posts with millions of reactions. For
the purpose of establishing benchmarks and with the goal of identifying the
best model for Sinhala sentiment analysis, we also test, on the same data set
configuration, other deep learning models catered for sentiment analysis. In
this study we report that the 3 layer Bidirectional LSTM model achieves an F1
score of 84.58% for Sinhala sentiment analysis, surpassing the current
state-of-the-art model; Capsule B, which only manages to get an F1 score of
82.04%. Further, since all the deep learning models show F1 scores above 75% we
conclude that it is safe to claim that Facebook reactions are suitable to
predict the sentiment of a text.Comment: 8 pages, LaTeX; typos correcte